Fix search_graph name_pattern= performance: regex cache, LIKE pre-filter, cheap count by awconstable · Pull Request #300 · DeusData/codebase-memory-mcp

awconstable · 2026-04-30T07:57:37Z

Fixes #254

Root cause

Three compounding bugs caused name_pattern= searches to scan every node with an expensive compiled regex, regardless of how selective the pattern was:

sqlite_iregexp / sqlite_regexp recompiled the regex on every row — cbm_regcomp + cbm_regfree fired once per node for the full table.
The count query wrapped the full SELECT (including two correlated edge-count subqueries per row) in SELECT COUNT(*) FROM (...), doubling the scan with identical per-row overhead.
cbm_extract_like_hints was implemented and correct but never called — the LIKE pre-filter that should cut the regex scan to only matching rows was dead code.

Changes

Fix 1 — regex cached per statement (sqlite_regexp / sqlite_iregexp)
Use sqlite3_get_auxdata / sqlite3_set_auxdata to cache the compiled cbm_regex_t for the lifetime of the statement. cbm_regcomp is now called exactly once per query, not once per row.

Fix 2 — LIKE pre-filter wired in (where_add_like_hints, search_where_basic)
Wire cbm_extract_like_hints into search_where_basic via a new where_add_like_hints helper. For .*Controller.* this prepends n.name LIKE '%Controller%'; the idx_nodes_name index satisfies the LIKE clause and only matching rows reach iregexp(). Added search_like_pool_t to manage the malloc'd LIKE strings across both statement executions. ST_SEARCH_MAX_BINDS raised 16 → 32.

Fix 3 — count query stripped of per-row edge subqueries
For the common no-degree-filter path, the count SQL is now SELECT COUNT(*) FROM nodes n WHERE <same WHERE> — no correlated edges subqueries. The degree-filter path retains the wrapped form since it needs those columns for the filter.

Benchmark

Tested on a large PHP codebase (~200K nodes):

Query	Before	After	Speedup
`name_pattern=.Controller.`	3099ms	508ms	6×
`name_pattern=.Service.`	2006ms	506ms	4×
`name_pattern=.Repository.`	2006ms	508ms	4×
`name_pattern=specificFunctionName`	1506ms	507ms	3×
`label=Method` + `name_pattern=.get.`	8509ms	509ms	17×

The ~500ms floor is cold-start I/O when spawning a fresh process against a ~500MB database. In the long-running MCP server (warm file cache) the query time is sub-millisecond.

A reusable benchmark script is included at scripts/benchmark-search-graph.sh.

Tests

All store search tests pass including store_search_pagination (offset-past-end total count), store_search_degree_filter, and the full store_extract_like_hints suite.

…ter, cheap count Three compounding bugs caused 1.5–8.5s latency on name_pattern= searches against large projects (216K nodes), now reduced to ~0ms query time (cold-start dominates): Fix 1 — regex compiled once per statement, not once per row sqlite_regexp / sqlite_iregexp now use sqlite3_get_auxdata / sqlite3_set_auxdata to cache the compiled cbm_regex_t for the lifetime of the statement. Previously cbm_regcomp + cbm_regfree ran for every row scanned. Fix 2 — LIKE pre-filter cuts rows reaching the regex Wire cbm_extract_like_hints (already implemented but dead) into search_where_basic via a new where_add_like_hints helper. For .*Controller.* this prepends n.name LIKE '%Controller%', letting the idx_nodes_name index satisfy the LIKE clause first and passing only matching rows to iregexp(). Added search_like_pool_t to manage the malloc'd LIKE strings across both statement executions. ST_SEARCH_MAX_BINDS raised 16 → 32 to accommodate extra bind slots. Fix 3 — count query no longer runs per-row edge subqueries The count SQL previously wrapped the full SELECT (which includes two correlated subqueries for in_deg / out_deg) in SELECT COUNT(*) FROM (...), executing those edge counts for every matching row even though the count needs none of that. Non-degree-filter path now uses SELECT COUNT(*) FROM nodes n WHERE <same WHERE>, which has no per-row subqueries. Degree-filter path retains the wrapped form since it needs those columns for the filter. Benchmark on home-ubuntu-dev-sis (216K nodes, 509MB DB): Query BEFORE AFTER speedup name_pattern=.*Controller.* 3099ms 508ms 6× name_pattern=.*Service.* 2006ms 506ms 4× name_pattern=.*Repository.* 2006ms 508ms 4× name_pattern=specificFuncName 1506ms 507ms 3× label=Method + name_pattern=.*get.* 8509ms 509ms 17× name_pattern=.*Approve.* 1506ms 507ms 3× name_pattern=.*authorize.* 1506ms 509ms 3× The ~500ms floor is cold-start I/O (opening a 509MB file from disk). In the long-running MCP server process the warm-cache query time is sub-millisecond. All store search tests pass including pagination, degree filter, and extract_like_hints. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Make project a required CLI argument instead of a hardcoded name, and remove internal query strings used during development testing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Flat BM25 queries of the form: SELECT ... FROM nodes_fts JOIN nodes WHERE MATCH ? AND project=? ORDER BY bm25() LIMIT N block FTS5 WAND/MaxScore early-exit — the outer JOIN+WHERE is invisible to the FTS5 planner, so it scores every matching document before any filter fires. On a large codebase with 100K+ matches this causes 2–16 minute queries. Fix: two-step subquery. The inner FTS5-only query: SELECT rowid, bm25(nodes_fts) FROM nodes_fts WHERE MATCH ? ORDER BY bm25() LIMIT 2000 can early-terminate because no outer predicate blocks it. The outer query then joins and filters at most BM25_INNER_LIMIT (2000) candidates. The count query uses the identical inner-limit subquery, so it benefits too. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

DeusData · 2026-05-10T17:51:28Z

Merged via rebase, thanks @awconstable — diagnoses are spot-on, fixes are clean, benchmarks reproduce. The auxdata caching is the canonical SQLite pattern, the LIKE pre-filter wiring is well-scoped (search_like_pool_t correctly handles the SQLITE_STATIC bind lifetime across both count and main statements), and the count-query split between the no-degree-filter and degree-filter paths is exactly right.

A note for anyone reading this thread later: the branch also contained the FTS5 two-step subquery fix that #302 was targeting, so #302 is now superseded — closing it as resolved.

Soft behavior note worth flagging on the FTS5 path: BM25_INNER_LIMIT = 2000 caps the inner candidate set, so the total reported to callers is now bounded by 2000 (or fewer post-filter). That makes pagination beyond offset 2000 silently saturate. Practically fine for ranked search — page 100 of search results was never going to be useful — but if anyone hits it later, the constant is the place to lift.

awconstable and others added 3 commits April 30, 2026 06:38

Remove internal project references from benchmark script

4b2b052

Make project a required CLI argument instead of a hardcoded name, and remove internal query strings used during development testing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

DeusData added bug Something isn't working stability/performance Server crashes, OOM, hangs, high CPU/memory labels May 4, 2026

DeusData merged commit 5f19454 into DeusData:main May 10, 2026

DeusData mentioned this pull request May 10, 2026

Fix search_graph query= multi-minute latency: two-step FTS5 subquery #302

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix search_graph name_pattern= performance: regex cache, LIKE pre-filter, cheap count#300

Fix search_graph name_pattern= performance: regex cache, LIKE pre-filter, cheap count#300
DeusData merged 3 commits intoDeusData:mainfrom
arbor-education:fix/254-search-graph-name-pattern-performance

awconstable commented Apr 30, 2026

Uh oh!

DeusData commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

awconstable commented Apr 30, 2026

Root cause

Changes

Benchmark

Tests

Uh oh!

DeusData commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants